End-to-End Pipeline for Building a Simple Lakehouse Analytics Solution

How To Organize Your Data in Databricks Lakehouse

Medallion Architecture
AI/BI dashboard
Business intelligence
Unity Catalog
Author

Phuong Nguyen

Published

January 14, 2025

Hello,

This post provides an understanding of data organization in Azure Databricks and guides you step-by-step through a use case to build a lakehouse analytics in Unity Catalog of Azure Databricks and use the curated data for dashboard design. We’ll use a dataset available at a URL, employing the medallion architecture.

Understanding data structure in Azure Databricks

Typically, one Unity Catalog is one meta store per Azure subscription or AWS tenant. The Unity Catalog served as a unified organizational structure that holds all data within the entire organization. Data is organized within Unity Catalog into catalogs -> schemas -> table.

•   Catalogs: Function like databases, holding multiple schemas.

•   Schemas: Contain multiple tables.

•   Tables: The organized storage of data

The medallion architecture provides a way to organize data in the lakehouse for managing and analyzing large, diverse data. It is structured by three primary layers, corresponding to each specific stage in the data handling process:

Bronze Layer (Raw Layer)

This layer includes raw data from external source systems. The table structures in this layer mirror the “as-is” structure of the source systems. The tables are designed to maintain their original format, reflecting the data exactly as it is in the source systems.

Silver Layer (Enriched Data)

The table structures in this layer mirror the “as-is” structure of the source systems. The tables are designed to maintain their original format, reflecting the data exactly as it is in the source systems.

Gold layer (curated data)

In this layer, data is organized in consumption-ready “project-specific” database. Data in Gold layer is highly curated and prepared for specific business use case such as analytics or reporting.

In summary, the data is curated as it moves through the different layers of a lakehouse. The lake house architecture enable data to be easy to understand and implement.

Use case: Building a simple Lakehouse analytics with data retrieved from an URL

Description

You have a dataset available on an external data source (Github). It is an academic Android malware classification dataset, given by Canadian Institute for Cybersecurity (CIC) and Canadian Centre for Cybersecurity (CCCS). The mission is to retrieve such dataset to your Unity Catalog and organize the data in different layer: Bronze, Silver and Gold. Then, to project the statistics on the impact of malware on mobile phones By the end of this use case, you will have a dashboard published to answer questions on the most affecting malware for mobile phone.

The guide step to step

Before we get started, it need to have an All-purpose compute resource. As in this demo, I run both Spark and SQL code in the same notebook. This is typically used for interactive data analysis in notebooks and supports running multiple programming languages. For the pipeline codes simply in SQL, select the option SQL warehouses. From the Databricks work space, click on Compute -> Select the tab All-purpose compute -> Click Create compute then following the instruction to configure the compute.

Step 1: Create Schema

  • In your Databricks workspace, click Catalog
  • Search and click on your catalog, e.g. azure_databricks_demo
  • Click the Create Schema button
  • Enter a name for the schema. e.g. malware
  • Click Create.

Step 2: Create Volume

  • Search or browse to the schema malware
  • Click Create Volume
  • Enter a name for the volume. e.g. malware_volume
  • Provide comment (optional)
  • Click Create.

Step 3. Create a notebook and add Spark - SQL pipeline code

1. Download dataset from external datasource (Github)

  • Click on New in the left side bar, then select Notebook. Rename it to “Malware_Pipeline_Spark_SQL”
  • In the cell code, change the language in cell to Python
  • Copy and paste the following code in Python
# Define the URL and the destination path in the volume
url = "https://raw.githubusercontent.com/Phuong-NTL/Dataset-collection/refs/heads/main/malware_dataset.csv"

# Define the catalog, schema, and volume
catalog_name = "azure_databricks_demo"
schema_name = "malware"
volume_name = "malware_volume"

# Construct the path to save the file in the volume
volume_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/raw_malware_dataset.csv"

# Use dbutils to download the file to the volume
dbutils.fs.cp(url, volume_path)

# Verify the file is downloaded
display(dbutils.fs.ls(f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/"))

2. Read the file into a DataFrame

# Read the raw file to dataframe
df = spark.read.format("csv") \
        .option("header", True) \
        .option("delimiter", ",") \
        .option("escape", "\\") \
        .load(volume_path)
display(df)
print(df.count())

3. Write to catalog

# Define the catalog and schema
catalog_name = "azure_databricks_demo"
schema_name = "malware"
table_name = "malware_raw"

# Create the catalog and schema if they do not exist
# spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")
# spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name}")

# Save the DataFrame as a table in the specified catalog and schema
df.write.format("delta").saveAsTable(f"{catalog_name}.{schema_name}.{table_name}")

Copy and paste the following code in SQL

SELECT * FROM azure_databricks_demo.malware.malware_raw

Organization of data

Bronze layer

-- Bronze layer: Raw data ingestion
CREATE OR REPLACE TABLE azure_databricks_demo.malware.malware_bronze_raw 
AS
SELECT *
FROM azure_databricks_demo.malware.malware_raw

Silver layer

-- Silver layer: Selected columns
CREATE OR REPLACE TABLE azure_databricks_demo.malware.malware_silver_selected_features 
AS
SELECT
    Category,
    Family,
    Process_total, 
    Battery_wakelock, 
    Memory_HeapSize,
    Memory_HeapAlloc,
    Memory_HeapFree
FROM azure_databricks_demo.malware.malware_bronze_raw

Gold layer

-- Gold layer: Max Process_total per Category
CREATE OR REPLACE TABLE azure_databricks_demo.malware.malware_gold_max_processes 
AS
SELECT
    Category,
    MAX(CAST(Process_total AS INT)) AS Total_Processes_Breached
FROM malware.malware_silver_selected_features
GROUP BY Category

Step 4: Discover data using Catalog Explorer

Explorer and verify the dataset generated through Unity Catalog via the path structure catalog_name/schema_name/table_name. - Click Catalog in the left sidebar - Click schema_name - Click table_name - Click on the tab Sample Data to load the table detail - Click on Lineage then click See lineage graph to see the overview of all the upstream and downstream tables of the current table.

Step 5: Create a Dashboard

  • On the current table, click on the button Create on the right side then select Dashboard. It automatically create a dashboard template on the current table.
  • Using Ask the Assistant and give a prompt, e.g. “create a line chart on Category and Total_Processes_Breached” . Click Accept to save the AI-generated chart.
  • Customize the chart using configuration option in the right.

Step 6: Publish and distribute your dashboard

  • Click Publish in the upper right corner to create a lean copy of the dashboard.
  • Click Publish. The dashboard now is ready to be shared to the others.

For a hands-on practice, watch my step-by-step tutorial video demonstrating the above process:

References

[1] https://community.databricks.com/t5/get-started-guides/getting-started-with-databricks-build-a-simple-lakehouse/ta-p/67404 [2] Canadian Institute for Cybersecurity (CIC) in collaboration with the Canadian Centre for Cybersecurity (CCCS)